AWS EC2 Discovery: discover in all enabled regions#61337
AWS EC2 Discovery: discover in all enabled regions#61337marcoandredinis merged 7 commits intomasterfrom
Conversation
401c3b8 to
1c46228
Compare
| return trace.BadParameter("discovery service requires at least one region") | ||
| m.Regions = []string{Wildcard} | ||
| } | ||
|
|
||
| for _, region := range m.Regions { | ||
| if region == Wildcard { | ||
| if len(m.Regions) > 1 { | ||
| return trace.BadParameter("when using %q as region, no other regions can be specified", Wildcard) | ||
| } | ||
|
|
||
| // Only EC2 supports discovering regions. | ||
| // TODO(marco): add support for other resources types. | ||
| if len(m.Types) > 1 || m.Types[0] != AWSMatcherEC2 { | ||
| return trace.BadParameter("only EC2 resource discovery supports discovering all regions, " + | ||
| "enumerate all the regions or create a separate matcher for discovering EC2 resources") | ||
| } | ||
| break | ||
| } | ||
|
|
There was a problem hiding this comment.
What implications does changing the CheckAndSetDefaults behavior have on backward compatibility?
There was a problem hiding this comment.
This change relaxes the validation: * is now a valid value for the regions array.
Upgrading
Not an issue, because everything that was valid before, is also valid now.
Downgrading
Setting a DiscoveryConfig with a * region, and then downgrading will emit a warning log:
WARN skipping resource due to unmarshal error error:[
Mixed: Auth/Proxy vX, DiscoveryService vX-1 (X-1 denotes the previous version, not necessarily the previous major)
This will result in the same behavior as above (warn log message)
So, in both situations the DiscoveryConfig is ignored.
There was a problem hiding this comment.
If the discovery config is ignored nothing new will be discovered until the * is removed or the discovery service is upgraded correct?
There was a problem hiding this comment.
Yes.
Upgrade, remove * and be explicit about which regions they want to discover.
There was a problem hiding this comment.
Would this also render the discovery service cache permanently unhealthy?
There was a problem hiding this comment.
Another option is to "downgrade" the resource from the Auth server for older clients. I'm not quite sure what we would substitute the wildcard with. I don't know that we want Auth trying to enumerate all regions.
There was a problem hiding this comment.
Not only we don't want to, Auth might not even have access to call the API.
We could set the region to aws-global sentinel value when the client is less than vXYZ. The cache would become healthy on the outdated Discovery Service client.
We would get the error down the line (eg, when trying to call ec:DescribeInstances)
So we would still get a failure, but just for that single DiscoveryConfig resource.
There was a problem hiding this comment.
I think this last option makes sense, you cannot use new features on older versions, but this should not result in the teleport process being completely broken (e.g. if I run both ssh and discovery service on the host, ssh should continue to work
There was a problem hiding this comment.
I think this last option makes sense
Are you referring to this option?
Detecting old clients and sending the aws-global region when the wildcard is set
I'm not only concerned about running two separate services, but having two discovery configs being consumed by a single Discovery Service.
If one of them has the regions: ["*"], both set of matchers will fail.
There was a problem hiding this comment.
Just implemented and tested that approach.
When clients are older than 18.4.2, it will replace the "*" with aws-global, which ensures the cache becomes healthy.
It does not mean it can discover the regions, but at least it's a soft error which does not impact the other DiscoveryConfigs being processed, neither other teleport services (eg, SSH Service).
1c46228 to
86f627d
Compare
49e8d69 to
066cc98
Compare
|
@r0mant @zmb3 @hugoShaka @tigrato @strideynet @bernardjkim @rosstimothy Can you please take a look when you get a chance? |
| return trace.BadParameter("discovery service requires at least one region") | ||
| m.Regions = []string{Wildcard} |
There was a problem hiding this comment.
Should we still require a region to be specified instead of defaulting to wildcard?
There was a problem hiding this comment.
From UX pov, I think the less options you need to set the better.
We are defaulting the tags matcher here, so that when nothing is set it will default to wildcard:wildcard.
I was following what we did for the tags, that's why I added it for the region.
I can remove the m.Regions = []string{Wildcard} when it's not set, and treat the empty regions list as fetch everything in the discovery service.
As an example, I think both configs should result in the same behavior:
aws:
- integration: teleportdev
tags:
"*": "*"
types:
- ec2
regions: ["*"]aws:
- integration: teleportdev
types:
- ec2There was a problem hiding this comment.
Can this have performance/stability implications? How bad is it if you run the discovery service on many regions with thousands of resources?
There was a problem hiding this comment.
I still think it might be best to explicitly reject the config when no regions are specified. Consider an example from the teleport file config: we enable Auth and Proxy by default to make UX easy, but that in reality is a footgun and has caused customers to unknowingly have hundreds of Auth and Proxy services because they didn't explicitly disable them in their agent configs.
We could update the error message to hint that wildcards are an available option.
There was a problem hiding this comment.
I don't think that is a fair comparison.
I see regions as a filter, so if I don't want to filter, I just expect the filter to not be there.
Another data point is that:
- azure and gcp matchers support empty list of regions, and that translates to matching in all regions / locations
- I see regions as a filter, same way we have for labels, which also translate empty map into
"*":"*"
I'll still revert that check, and focus on the cache issue instead.
| return trace.BadParameter("discovery service requires at least one region") | ||
| m.Regions = []string{Wildcard} | ||
| } | ||
|
|
||
| for _, region := range m.Regions { | ||
| if region == Wildcard { | ||
| if len(m.Regions) > 1 { | ||
| return trace.BadParameter("when using %q as region, no other regions can be specified", Wildcard) | ||
| } | ||
|
|
||
| // Only EC2 supports discovering regions. | ||
| // TODO(marco): add support for other resources types. | ||
| if len(m.Types) > 1 || m.Types[0] != AWSMatcherEC2 { | ||
| return trace.BadParameter("only EC2 resource discovery supports discovering all regions, " + | ||
| "enumerate all the regions or create a separate matcher for discovering EC2 resources") | ||
| } | ||
| break | ||
| } | ||
|
|
There was a problem hiding this comment.
Would this also render the discovery service cache permanently unhealthy?
| } | ||
|
|
||
| // GetInstances fetches all EC2 instances matching configured filters. | ||
| func (f *ec2InstanceFetcher) GetInstances(ctx context.Context, rotation bool) ([]Instances, error) { |
There was a problem hiding this comment.
Disclaimer: Not something that needs to be addressed in this PR but something we should consider going forward to allow the discovery service to scale. This API could lead to OOMing the discovery service if a user has a massive amount of matching instances. It doesn't look like the consumer of these instances requires the entire set to perform its duties. If that is the case we could leverage iterators here to handle a page of instances at a time on demand.
| func (f *ec2InstanceFetcher) GetInstances(ctx context.Context, rotation bool) ([]Instances, error) { | |
| func (f *ec2InstanceFetcher) Instances(ctx context.Context, rotation bool) iter.Seq2[[]Instances, error] |
There was a problem hiding this comment.
Yeah, we probably need to change this on all 3 implementations - aws, gcp and azure.
066cc98 to
a16dba6
Compare
hugoShaka
left a comment
There was a problem hiding this comment.
lgtm once the downgrade scenario doesn't break the cache
| return trace.BadParameter("discovery service requires at least one region") | ||
| m.Regions = []string{Wildcard} |
There was a problem hiding this comment.
Can this have performance/stability implications? How bad is it if you run the discovery service on many regions with thousands of resources?
| return trace.BadParameter("discovery service requires at least one region") | ||
| m.Regions = []string{Wildcard} | ||
| } | ||
|
|
||
| for _, region := range m.Regions { | ||
| if region == Wildcard { | ||
| if len(m.Regions) > 1 { | ||
| return trace.BadParameter("when using %q as region, no other regions can be specified", Wildcard) | ||
| } | ||
|
|
||
| // Only EC2 supports discovering regions. | ||
| // TODO(marco): add support for other resources types. | ||
| if len(m.Types) > 1 || m.Types[0] != AWSMatcherEC2 { | ||
| return trace.BadParameter("only EC2 resource discovery supports discovering all regions, " + | ||
| "enumerate all the regions or create a separate matcher for discovering EC2 resources") | ||
| } | ||
| break | ||
| } | ||
|
|
There was a problem hiding this comment.
I think this last option makes sense, you cannot use new features on older versions, but this should not result in the teleport process being completely broken (e.g. if I run both ssh and discovery service on the host, ssh should continue to work
| if len(m.Types) > 1 || m.Types[0] != AWSMatcherEC2 { | ||
| return trace.BadParameter("only EC2 resource discovery supports discovering all regions, " + | ||
| "enumerate all the regions or create a separate matcher for discovering EC2 resources") | ||
| } | ||
| break |
There was a problem hiding this comment.
Should we move this out of the CheckAndSetDefaults and validate this property when creating/updating the resource. This might save us a big headache when we'll need to add region discovery to other things later and older versions will get angry.
(this might not be a big issue with the cache changes anymore)
There was a problem hiding this comment.
I think I will log the failure in DiscoveryService instead, during processing.
This whole validation is very problematic 😢
I intend to change validations of other fields, so I'll get the same cache issue again.
Can I just remove the CheckAndSetDefaults when doing Un/MarshalDiscoveryConfig calls?
teleport/lib/services/discoveryconfig.go
Line 96 in 2cced62
ec9b86b to
288a3a2
Compare
288a3a2 to
7703df2
Compare
7703df2 to
1e02293
Compare
|
@marcoandredinis See the table below for backport results.
|
* AWS EC2 Discovery: discover in all enabled regions * use wildcard * require at least one region * downgrade discovery service for old clients * move region x type check to runtime * add test back * fix expected version
* AWS EC2 Discovery: discover in all enabled regions * use wildcard * require at least one region * downgrade discovery service for old clients * move region x type check to runtime * add test back * fix expected version
AWS EC2 Server discovery requires a list of regions, where the service is going to look for EC2 instances.
This PR changes the matcher so that it accepts a wildcard value for the regions field.
This will call
account.ListRegionsAWS API, which returns all the regions enabled for the current account.Docs will be updated afterwards.
Demo:
changelog: Added support for discovering EC2 instances in all regions, without enumerating them. Requires access to
account.ListRegionsin the IAM Role assumed by the Discovery Service.